Abstract
The given FASTQ files (AH_S1 and CH_S2) were analyzed by a standard analytical pipeline to map the reads to a reference, hg19. The datasets were firstly compared based on the results of FastQC following skewer. The GATK Picard resulted in aligned BAM files by sequencing alignment and several post-processing including MarkDuplicates. Lastly, the variations were called by Mutect2.
Figure 1. Schematic of AH_S1 and CH_S2 analysis workflow
The quality of the two datasets were measured by FastQC after trimming adapter sequences from raw FASTQ files. Two files per each dataset were processed and the outputs were summarized in Figure 2. Among 12 QC categories, AH_S1 fails in 4 categories with 1 warning and CH_S2 fails in 2 categories with 3 warnings.
Figure 2. Summary of FastQC results
The output BAM files were analyzed by Picard CollectTargetedPcrMetrics to calculate the PCR based metrics for targeted regions (Table 1)
| AH_S1 | CH_S2 | |
|---|---|---|
| CUSTOM_AMPLICON_SET | AH_S1_target | CH_S2_target |
| GENOME_SIZE | 3101804739 | 3101804739 |
| AMPLICON_TERRITORY | 23498 | 21282 |
| TARGET_TERRITORY | 23498 | 21282 |
| TOTAL_READS | 2000000 | 2000000 |
| PF_READS | 2000000 | 2000000 |
| PF_BASES | 300000000 | 292610263 |
| PF_UNIQUE_READS | 20892 | 1155004 |
| PCT_PF_READS | 1 | 1 |
| PCT_PF_UQ_READS | 0.010446 | 0.577502 |
| PF_UQ_READS_ALIGNED | 14403 | 1150635 |
| PF_SELECTED_PAIRS | 948114 | 772114 |
| PF_SELECTED_UNIQUE_PAIRS | 4029 | 403052 |
| PCT_PF_UQ_READS_ALIGNED | 0.689403 | 0.996217 |
| PF_BASES_ALIGNED | 283454480 | 285805025 |
| PF_UQ_BASES_ALIGNED | 1812176 | 163630869 |
| ON_AMPLICON_BASES | 195086105 | 85441202 |
| NEAR_AMPLICON_BASES | 75600787 | 137948372 |
| OFF_AMPLICON_BASES | 12767588 | 62415451 |
| ON_TARGET_BASES | 719566 | 42691658 |
| ON_TARGET_FROM_PAIR_BASES | 713570 | 42639040 |
| PCT_AMPLIFIED_BASES | 0.954957 | 0.781615 |
| PCT_OFF_AMPLICON | 0.045043 | 0.218385 |
| ON_AMPLICON_VS_SELECTED | 0.720708 | 0.382476 |
| MEAN_AMPLICON_COVERAGE | 8302.243 | 4014.717 |
| MEAN_TARGET_COVERAGE | 30.62244 | 2005.998 |
| MEDIAN_TARGET_COVERAGE | 25 | 200 |
| FOLD_ENRICHMENT | 90850.34 | 43571.2 |
| ZERO_CVG_TARGETS_PCT | 0 | 0 |
| PCT_EXC_DUPE | 0.993607 | 0.427474 |
| PCT_EXC_MAPQ | 0.000362 | 0.011102 |
| PCT_EXC_BASEQ | 0 | 7.2e-05 |
| PCT_EXC_OVERLAP | 0 | 0 |
| PCT_EXC_OFF_TARGET | 0.006004 | 0.5595 |
| FOLD_80_BASE_PENALTY | 1.80132 | 10.02999 |
| PCT_TARGET_BASES_1X | 0.999957 | 1 |
| PCT_TARGET_BASES_2X | 0.999957 | 1 |
| PCT_TARGET_BASES_10X | 0.98336 | 1 |
| PCT_TARGET_BASES_20X | 0.695463 | 1 |
| PCT_TARGET_BASES_30X | 0.354328 | 1 |
| AT_DROPOUT | 12.71635 | 4.323885 |
| GC_DROPOUT | 1.916709 | 0.544665 |
| HET_SNP_SENSITIVITY | 0.995386 | 1 |
| HET_SNP_Q | 23 | -1 |
Figure 3. Histogram of depths of AH_S1 and CH_S2
Figure 4. Sequencing depth of common regions of AH_S1 and CH_S2
Those aligned BAM files were used for variant calling compared with hg19 reference as normal tissue sequecing dataset was not given. Mutect2 identified 101 and 219 variation locations for AH_S1 and CH_S2 respectively and 25 of them is common (Figure 5). The common variant calls were annotated by VEP and Table 2 highlights nonsynonymous mutations involving mutational hotspots or Cancer Hotspots.
Figure 5. Overlaps of variant calls between AH_S1 and CH_S2 samples
| Location | Allele | Consequence | IMPACT | SYMBOL | Protein_position | Amino_acids |
|---|---|---|---|---|---|---|
| 1:115256530-115256530 | T | missense_variant | MODERATE | NRAS | 61 | Q/K |
| 12:25398281-25398281 | T | missense_variant | MODERATE | KRAS | 13 | G/D |
| 15:66727451-66727451 | C | missense_variant | MODERATE | MAP2K1 | 56 | Q/P |
| 15:66729147-66729147 | T | missense_variant | MODERATE | MAP2K1 | 119 | H/Y |
| 15:90631917-90631918 |
|
frameshift_variant | HIGH | IDH2 | 145 | G/X |
| 15:90631934-90631934 | T | missense_variant | MODERATE | IDH2 | 140 | R/Q |
| 17:7578388-7578388 | T | frameshift_variant | HIGH | TP53 | 181 | R/QX |
| 17:7579472-7579472 | C | missense_variant | MODERATE | TP53 | 72 | P/R |
| 3:178936082-178936082 | A | missense_variant | MODERATE | PIK3CA | 542 | E/K |
| 3:178952085-178952085 | G | missense_variant | MODERATE | PIK3CA | 1047 | H/R |
| 3:41266101-41266101 | A | missense_variant | MODERATE | CTNNB1 | 33 | S/Y |
| 3:41266133-41266136 |
|
inframe_deletion | MODERATE | CTNNB1 | 44-45 | PS/P |
| 7:140453136-140453136 | C | missense_variant | MODERATE | BRAF | 600 | V/G |
| 7:55241707-55241707 | A | missense_variant | MODERATE | EGFR | 719 | G/S |
| 9:80409488-80409488 | A | missense_variant | MODERATE | GNAQ | 209 | Q/L |
This report shows the workflow to analyze two different NGS library sequencing results. Overall, CH_S2 could be a better choice than AH_S1 considering the sequecing quality, consistency of read depth and also extent of variation calls.
The Rmd file of this report and source codes are available in my repository github.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 VennDiagram_1.6.20 futile.logger_1.4.3
## [4] vcfR_1.12.0 kableExtra_1.3.1 ngsReports_1.2.0
## [7] tibble_3.0.4 ggplot2_3.3.2 BiocGenerics_0.32.0
## [10] BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-143 bitops_1.0-6
## [3] matrixStats_0.57.0 lubridate_1.7.9.2
## [5] webshot_0.5.2 httr_1.4.2
## [7] GenomeInfoDb_1.22.1 tools_3.6.1
## [9] vegan_2.5-6 R6_2.5.0
## [11] mgcv_1.8-31 lazyeval_0.2.2
## [13] colorspace_2.0-0 permute_0.9-5
## [15] withr_2.3.0 tidyselect_1.1.0
## [17] compiler_3.6.1 rvest_0.3.6
## [19] Biobase_2.46.0 formatR_1.7
## [21] flashClust_1.01-2 xml2_1.3.2
## [23] DelayedArray_0.12.3 plotly_4.9.2.1
## [25] ggdendro_0.1.22 scales_1.1.1
## [27] stringr_1.4.0 digest_0.6.27
## [29] Rsamtools_2.2.3 rmarkdown_2.5
## [31] XVector_0.26.0 jpeg_0.1-8.1
## [33] pkgconfig_2.0.3 htmltools_0.5.0
## [35] FactoMineR_2.3 highr_0.8
## [37] htmlwidgets_1.5.2 rlang_0.4.8
## [39] rstudioapi_0.13 visNetwork_2.0.9
## [41] generics_0.1.0 farver_2.0.3
## [43] zoo_1.8-8 hwriter_1.3.2
## [45] jsonlite_1.7.1 BiocParallel_1.20.1
## [47] dplyr_1.0.2 RCurl_1.98-1.2
## [49] magrittr_2.0.1 GenomeInfoDbData_1.2.2
## [51] leaps_3.1 Matrix_1.2-18
## [53] Rcpp_1.0.5 munsell_0.5.0
## [55] S4Vectors_0.24.4 ape_5.4-1
## [57] lifecycle_0.2.0 scatterplot3d_0.3-41
## [59] stringi_1.4.6 yaml_2.2.1
## [61] MASS_7.3-51.5 SummarizedExperiment_1.16.1
## [63] zlibbioc_1.32.0 plyr_1.8.6
## [65] pinfsc50_1.2.0 ggrepel_0.8.2
## [67] crayon_1.3.4 lattice_0.20-38
## [69] splines_3.6.1 Biostrings_2.54.0
## [71] pander_0.6.3 knitr_1.30
## [73] pillar_1.4.7 GenomicRanges_1.38.0
## [75] reshape2_1.4.4 stats4_3.6.1
## [77] futile.options_1.0.1 glue_1.4.2
## [79] evaluate_0.14 ShortRead_1.44.3
## [81] latticeExtra_0.6-29 memuse_4.1-0
## [83] lambda.r_1.2.4 data.table_1.13.2
## [85] BiocManager_1.30.10 vctrs_0.3.4
## [87] png_0.1-7 gtable_0.3.0
## [89] purrr_0.3.4 tidyr_1.1.2
## [91] xfun_0.19 viridisLite_0.3.0
## [93] truncnorm_1.0-8 GenomicAlignments_1.22.1
## [95] IRanges_2.20.2 cluster_2.1.0
## [97] DiagrammeR_1.0.6.1 ellipsis_0.3.1